Method and apparatus for testing the responsiveness of a network device

ABSTRACT

Method and apparatus for fault management of computer networks which utilizes a proxy or recruit network device to test the responsiveness of a network device. When a first network device loses contact with a second network device, the first network device uses a proxy network device to determine if the second network device can be reached and reports back to the first network device whether the contact attempt was successful. The proxy network device may contact the second network device through a different path and/or protocol than used by the first network device.

FIELD OF THE INVENTION

[0001] This invention relates to fault management of computer networksand, more particularly, to a method and apparatus wherein a firstnetwork device employs a proxy or recruit network device to test theresponsiveness of another network device.

BACKGROUND OF THE INVENTION

[0002] Networks provide increased computing power, sharing of resourcesand communications between users. A network may include a number ofcomputer devices within a room, building, or site that areinterconnected by a high speed local data link to form a local areanetwork (LAN), such as a token ring network, ethernet network, or thelike. LANs in the same or different locations may be interconnected bydifferent media and protocols such as packet switching, microwave linksand satellite links to form a wide area network. There may be severalhundred or more interconnected devices in a network.

[0003] As a network becomes larger and more complex, issues arise as tothe amount of traffic on the network, utilization of resources, securityand the isolation of network faults. In U.S. Pat. No. 5,436,909, whichissued to Roger Dev et al. on Jul. 25, 1995, and which is hereinincorporated by reference in its entirety, a system for isolatingnetwork faults is disclosed. In the '909 patent, a network managementsystem models network devices and relations between network devices. Acontact status of each device is contained in a corresponding model.Each model receives status updates from and/or regularly polls thecorresponding network device.

[0004] The '909 patent uses a technique known as “status suppression” inorder to isolate network faults. When a first network device has lostcontact with its corresponding model, the models which correspond tonetwork devices adjacent to the first network device are polled to seeif they have also lost contact with their corresponding network devices.If the adjacent models cannot contact their corresponding networkdevices, then presumably the first network device is not the cause ofthe fault and a fault status in the first model is suppressed oroverridden. If it is determined that all adjacent network devices arenot communicating, then the network fault can be more easily determinedas something common to all of these devices.

[0005] It may be advantageous to focus the failure analysis on the firstnetwork device without polling all of the adjacent network devices. Insome large networks, such polling could involve hundreds, possiblythousands, of network devices thereby increasing the amount of trafficon the network and degrading network performance. In addition, there maybe network devices that, although they have lost contact with thenetwork management system, are still in contact with some other networkdevice.

[0006] It is an object of the present invention to provide a method tofacilitate fault management in a network which can be used alone ortogether with other fault management services to deduce the locationand/or cause of a network failure.

SUMMARY OF THE INVENTION

[0007] The present invention relates to a method and apparatus fordetermining the responsiveness of a network device through the use ofproxy or recruit network devices. More specifically, when a firstnetwork device has lost contact with a second network device, a proxydevice is recruited to attempt to contact the second network device.Typically, this recruit utilizes a different physical path to the secondnetwork device and/or a different communication protocol for contactingthe second device. The recruit then reports on whether the contact wassuccessful. If it was successful, then the first network device caninfer that the cause of its contact loss may lie with its path to thesecond network device or with the protocol the first device uses tocontact the second device.

[0008] In one embodiment, a list of potential recruits is maintained atone or more locations in the network. Then, when a first network deviceloses contact with a second network device, one or more recruits fromthe list can be selected to attempt to contact the second networkdevice. Where a plurality of recruits are selected, the recruits mayattempt to contact the second device either in series or in parallel.The recruits then report back the results of their attempts, from whicha better understanding of the location and/or cause of the networkfailure may be determined. This method may be used alone or incombination with other fault management services. It may advantageouslybe used in conjunction with a network management platform, such as theSPECTRUM® management system, available from Cabletron Systems, Inc.,Rochester, N.H., which models the various devices (i.e., physicaldevices and applications) on the network, and maintains a contact statusfor each such device.

[0009] These and other advantages of the present invention will beunderstood from the following drawings and detailed description of anexemplary embodiment.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010]FIG. 1 is a block diagram of a network management systemoverseeing a network, which management system may incorporate thepresent invention;

[0011]FIG. 2 is a flow chart illustrating an example of the operation ofa network management system which utilizes the fault services of thepresent invention in accordance with one embodiment;

[0012]FIG. 3 is a flow chart of the fault management service accordingto another embodiment.

[0013]FIG. 4 is a schematic representation of a network illustrating theuse of a recruit or proxy network device to contact a second networkdevice which has lost contact with a first network device (the networkmanagement system);

[0014]FIG. 5 is a schematic representation of a network for illustratingan exemplary use of the present invention; and

[0015]FIG. 6 shows a general purpose computer as one example ofimplementing the present invention.

DETAILED DESCRIPTION

[0016] A block diagram of an overall system according to the presentinvention is shown in FIG. 1. A network 106 includes a plurality ofinterconnected network devices (not shown). A network management system100 communicates with the network 106 to maintain the network inoperating condition and to monitor the operations of the network. Thenetwork management system 100 is coupled to a database manager 104 whichmanages the storage and retrieval of disk-based data relative to thenetwork 106 and the network management system 100. A user interface 102is coupled to the network management system 100 which allows a user,usually a network manager, to interface with the network managementsystem 100. The user interface 102 includes a keyboard and display 107and other appropriate input/output devices, e.g., a mouse or joystick108.

[0017] The hardware for supporting a network management system as shownin FIG. 1 is typically a workstation, such as a Sun Model 3 or 4, or aPC compatible computer running Unix. Sufficient memory is required inorder to run this system and may include 16 megabytes or more of memoryalong with a display device which supports the required color andresolution. The basic operating software which runs on the computer maysupport sockets, X-windows and/or the Open Software Foundation (OSF)Motif 1.0. The network management system in this embodiment isimplemented using the C++ programming language; it could be implementedin another object-oriented language such as Smalltalk or ADA, or inanother (non-object oriented) language such as C, Pascal, or Cobalt. Thenetwork management system 100 may comprise more than one computer, whereeach computer is dedicated to a particular function involved inmonitoring and/or controlling the network 106.

[0018] The present embodiment was developed for the Cabletron Spectrum®platform, although the solution may be applied to a variety of networkmanagement platforms.

[0019] The present invention determines the responsiveness of a secondnetwork device (i.e., a physical device or software application) byusing one or more other devices and applications which attempt tocontact the second device. A “proxy” or recruit network device isdefined as a network device that can be used to assist in determining oranalyzing another network device's communication capability. When asecond network device is determined to have become incommunicado becauseof a loss of contact with a first network device, a recruit networkdevice can be used to determine if the second network device can becommunicated with albeit along some different route and/or using adifferent protocol from that of the first device.

[0020] In this embodiment, recruit network devices “register” with aglobal recruiter. The global recruiter maintains a list of all recruitnetwork devices, each recruit having a network unique identifier. Theglobal recruiter may reside at one central location or be distributedacross the network. The list is modified as devices enter and leave thenetwork. Each time a new network device or application that can functionas a recruit comes into the network, it registers itself with the globalrecruiter. The global recruiter does not need to know the specificprotocol or means of communication of that recruit, it only needs torecognize the recruit network device's existence and have some way ofcommunicating with the recruit.

[0021] When a fault management service within the network managementsystem 100 recognizes that a particular network device has gone down,i.e., cannot be communicated with, a request is sent to the globalrecruiter for a list of possible recruit network devices. The recruitsactually used in a specific case may be all or only a subset of allpossible recruits, depending upon the application of certain parametersfor selecting from the list. Once the desired recruits are determined, arequest can be sent to each recruit asking it to determine and/or verifythe responsiveness with the particular down device. When called upon, arecruit can use its own specific means of verifying this responsiveness.The verification process used can be proprietary if necessary. Only therecruit network device needs to know how the process works. Neither thefault management service of the network management system nor the globalrecruiter needs to know the actual protocol and process being used. Thisallows a general purpose algorithm the ability to have device andapplication models i.e., recruits, implement very specific means ofverifying communication.

[0022] The recruit network device may be considered a proxy agent forthe fault management service since the recruit network device is askedby the fault management service to perform some function on its behalf,i.e., to contact the non-responsive or down device. At this point, itbecomes the recruit's responsibility to test for responsiveness. Oncetested, the recruit reports back to the fault management service thestatus and/or success of its attempt to communicate with the downdevice. The fault management service of the network management systemthen determines whether further analysis or action is required.

[0023] One advantage of this system is that the recruit may use adifferent communication method than that of the fault management serviceand/or network management system. A recruit network device may have analternate path to the down device. It may also support a protocoldifferent from that of the network management system and it may havesome proprietary knowledge that the fault management service lacksknowledge of.

[0024] For example, one protocol which allows a device to make SNMP(Simple Network Management Protocol) requests of another device is theDistributed LAN Manager, DLM. DLM is available from Cabletron Systems,Inc., Rochester, N.H. A DLM management information base (MIB) enables auser to specify the queries and querying options desired. Any devicethat has a DLM application built-in can enlist with the globalrecruiter. The fault management service can then keep track of theseapplications and call upon them when needed. The DLM application canutilize an entry in a DLM MIB table to attempt to reach the down device.

[0025] If a particular network device appears to be responsive to acertain recruit network device then it may be valuable to keep track ofthat recruit network device for future use. Generally, the networkmanagement system does not lose responsiveness with only a singledevice. It is more common to lose contact with an entire group ofdevices and/or a subnet (i.e., a logical subset of the network). Whenthis occurs, and a recruit network device is found to make contact withone of the down devices, it is often the case that the same recruit willbe able to contact other devices that are affected by that fault.Keeping track of and reusing these previously successful recruit networkdevice saves time and network traffic.

[0026] When a developer is creating a device model based on an actualnetwork device's capabilities, the developer may recognize that theactual network device has certain capabilities which enable it to be aproxy or recruit device, e.g., the ability to send management protocolcommands or requests. As a result, the developer may create a modelassociated with this type of network device so that each time a deviceof this type is added to a network and, therefore, a model is added tothe Cabletron Spectrum® System, it would be known that this device canfunction as a proxy or recruit. When this model is added to the network,it would know to register with the global recruiter and have itself puton a list of possible recruits.

[0027]FIG. 5 illustrates a Spectrum® user display 520 of the networkmanagement system 100 and the network 106. As shown in FIG. 5, networkmodels 500-514 are shown as interconnected. Each model is an object (asin object-oriented programming), meaning that it contains data andoperations relating to the device being modeled. Each model 500-514corresponds to an actual network device with which the model is incontact. For example, model 504 represents a corresponding networkdevice having the proxy or recruit capability defined in the model. As aresult, the network management system 100 would be able to identifymodel 504, and its corresponding device, as a recruit network device.If, as an example, the model 506 were to lose contact with itscorresponding network device, i.e., the network management system haslost contact with device 506, the system would represent the model 506(in display 520) with the color red. It might also be the case that eachof models 508-514 have also lost contact with their correspondingnetwork devices due to the network topology in this example. These modelrepresentations would be displayed in grey. The model 506 is representedin red to indicate that, in the topology, this is the first device thatis identified as no longer in communication. In other words, since themodel 502 can contact its corresponding network device and the model506, which is adjacent to model 502, cannot, the model 506 becomes a“border model.” The non-border models 508-514 are grey to indicate themanagement system has also lost contact with them. The networkmanagement system would then request that recruit network device,represented by model 504, attempt to contact the network devicerepresented by model 506. Model 504 may be able to contact the devicerepresented by model 506 through device 502. This information would bereturned back to the network management system 100 for analysis. As aresult, the Spectrum® System may change the color of model 506 from redto orange, indicating that alternate means of communication are stillavailable. Further, each of the remaining devices which had beenindicated in grey, i.e.. models 508-514, would be accessed by therecruit device model 504 to see if contact could be established. Thisinformation would also be reported back to the network management system100. As each device is contacted, its color may be changed from grey toorange to indicate that alternate communication is still available.

[0028] One concern that may arise is the amount of traffic generatedfrom recruit network devices. If it is necessary to limit this traffic,parameters can be implemented to limit the number of recruit networkdevices that are utilized. For example, these parameters may result inchoosing a limited number of different recruit network devices, each ofwhich attempts a different protocol and/or path to determine thecommunication ability of the down device. These parameters may include,but are not limited to imposing a limit as to the number of recruitsused, e.g., only using X number of recruit network devices; using onlythose recruit network devices that employ a different communication pathfrom that of the network management system; using only those recruitnetwork devices that employ a different communication protocol than thatof the network management system; using only those recruit networkdevices within the same subnet as the down device; preferentiallyselecting recruits that have previously successfully contacted the samedevice, or a device in the same logical workgroup or topological group;or using only those recruit network devices that are consideredimmediate neighbors of the down device. In this aspect, neighbor refersto devices that are physically connected to each other. In addition, arandom subset of the list of recruit network devices may also be used.

[0029] When the list of recruit network devices is determined for a downdevice, the recruit network devices may attempt to contact the downnetwork device either serially or in parallel. In either case, eachrecruit network device would return the results of its attemptedcommunication with the down network device, i.e., whether or notcommunication was successfully established.

[0030] A flowchart of one method embodiment is presented in FIG. 2,where the recruits attempt to contact the down device in series. In step200, contact with device D is lost. In step 202, the parameters for therecruits are established which may include, e.g., X number of recruitsor only those recruits that use a different communication protocol, etc.In step 204, the list of recruits is retrieved. A first recruit isselected in step 206 and in step 208 this recruit is asked to contactdevice D and report back. In step 210, the report is received from therecruit regarding whether or not contact was successful. In step 211, itis determined whether the recruit was able to contact the down device D.If contact was successful, control passes to step 212 where theinformation from the recruit device is stored. At step 218, theinformation is processed by the fault analysis process. Since contactwas established by a recruit, it may not be necessary to determine ifthere are any other recruits which can contact the device. Of course,the process can be modified to await a report from all recruits if suchadditional information is of value. At step 211, if the recruit wasunsuccessful, control passes to step 214. In step 214, a determinationis made if there are any more recruits in the list. If there are morerecruits in the list, step 216 is executed where the next recruit isrequested to contact down device D in step 208. If there are no morerecruits in the list in step 214, operation proceeds to the end. Asshown in FIG. 2, the recruits serially, one after the other, attemptcommunication with down device D.

[0031] A flowchart of another method embodiment is shown in FIG. 3,where the recruits contact the down device D in parallel, i.e., at thesame time. Steps 300-304 correspond to steps 200-204 in FIG. 2. In step306, all recruits are requested to contact device D and report back. Instep 308, the reports from the recruits are received. In step 310, theinformation received from a recruit is processed and if contact wassuccessful, in step 314 the received information is processed. If notsuccessful at step 310, control passes to step 312 and if there are morerecruits to be heard from, control passes back to step 308 to awaitthose reports.

[0032] The information retrieved from the processes as shown in FIGS. 2and 3 can be used to analyze the nature and effect of the fault. Theprocessing from this point on, steps 218 and 314, is dependent on boththe network management platform and the fault management service.

[0033] A simple example of the present method is set forth in FIG. 4. Asshown in FIG. 4, both router-2 (122) and device-1 (128) have enlistedwith the global recruiter (part of 100) as recruit network devices. Thenetwork management system 100 talks to router-1 (120), router-2 anddevice-2 (126) through path-A (130). In addition, the network managementsystem 100 talks to router-1, router-3 (124) and device-1 through path-B(132). In the example, the link between router-2 and device-2 is brokenand the network management system 100 can no longer communicate withdevice-2. The network management system 100 makes a request of the faultmanagement service to identify whether or not device-2 is actuallyfunctioning. The fault management service retrieves the list of recruitsthat it should use for communication with device-2. In this simpleexample, there is no need to limit the list, so router-2 and device-1are the recruit network devices in the list. The fault managementservice will ask the first recruit, router-2, to attempt contact withthe device. As per the example, router-2 will report that it cannotcommunicate with device-2. Next, device-1 is requested to attemptcommunication with device-2. If the recruit device-1 uses path-C (134)to communicate with device-2, it will report back that it cancommunicate with device-2. This will identify to the network managementsystem 100 that the failure does not lie with device-2, since device-1has reported that it can communicate with device-2. This information canthen be used to temporarily reroute data destined to and from device-2and/or implement repair functions.

[0034] The fault management system of the present invention may beimplemented as software on a floppy disk or hard drive which controls acomputer, for example a general purpose computer such as a workstation,mainframe, or personal computer to perform the steps of the processesdisclosed in FIGS. 2 and 3. Such a General purpose computer 70, as shownin FIG. 6, typically includes a central processing unit 72 (CPU) coupledto random access memory 74 (RAM) and program memory 76 via a databus 78.The general purpose computer 70 may be connected to the network in orderto receive reports and provide commands to devices on the network.

[0035] Alternately, the invention may be implemented as a specialpurpose electronic hardware. Additionally, in either a hardware orsoftware embodiment, the functions performed by the different elementsmay be combined in various arrangements of hardware and software.

[0036] While the present embodiment was developed on a CabletronSpectrum® system which uses models of network entities and models ofrelations which define relations between network entities, one ofordinary skill in the art can see that this method does not have to berun on such a system. While there have been shown and described certainembodiments of the present invention, it would be obvious to thoseskilled in the art that various changes and modifications may be madetherein without departing from the scope of the invention as defined bythe appended claims.

What is claimed is:
 1. A method of determining responsiveness of anetwork device in a network including a plurality of interconnectednetwork devices, the method comprising the steps of: when a firstnetwork device loses contact with a second network device, the firstnetwork device requesting that a proxy network device contact the secondnetwork device and report back to the first network device whether therequested contact is successful.
 2. The method as recited in claim 1 ,wherein the proxy network device contacts the second network devicealong a second path, different from a first path used by the firstnetwork device to the second network device.
 3. The method as recited inclaim 1 , wherein the proxy network device uses a second communicationsprotocol to contact the second network device, different from a firstcommunications protocol used by the first network device.
 4. The methodas recited in claim 1 , wherein the proxy network device is on a samelogical subset of the network as the second network device.
 5. Themethod as recited in claim 1 , wherein the requesting step includesselecting from a list of potential proxy network devices at least oneselected proxy network device to contact the second network device. 6.The method as recited in claim 1 , wherein the first network devicerequests that a plurality of proxy network devices contact the secondnetwork device.
 7. The method as recited in claim 6 , wherein theplurality of proxy network devices attempt, in series, to contact thesecond network device.
 8. The method as recited in claim 6 , wherein theplurality of proxy network devices attempt, in parallel, to contact thesecond network device.
 9. The method as recited in claim 1 , wherein theproxy network device is a neighbor of the second network device.
 10. Themethod as recited in claim 1 , further comprising the step of: analyzingthe report from the proxy network device to determine a type of faultassociated with the second network device.
 11. The method as recited inclaim 1 , further comprising the step of: determining an operativestatus of the second network device from a loss of contact with thesecond network device.
 12. The method recited in claim 1 , wherein thefirst network device is a network management system.
 13. The methodrecited in claim 5 , wherein the selecting step comprises selecting aproxy network device which has previously successfully contacted thesecond network device.
 14. The method recited in claim 5 , wherein theselecting step comprises selecting a proxy network device which haspreviously successfully contacted a neighbor of the second networkdevice.
 15. The method recited in claim 5 , wherein the selecting stepcomprises selecting a proxy network device which has previouslysuccessfully contacted another network device.
 16. The method as recitedin claim 5 , wherein the selecting step is based on at least one of thefollowing parameters: a) a communication path from the proxy networkdevice to the second network device is different from a communicationpath from the first network device to the second network device; b) acommunication protocol of the proxy network device is different from acommunication protocol of the first network device; c) the proxy networkdevice is in a same logical subset of the network as the second networkdevice; and d) a total number of proxy network devices in the list isnot greater than a predetermined number.
 17. A method for registering aproxy network device in a computer network including a plurality ofinterconnected network devices, the method comprising the steps of:determining whether each network device can function as a proxy networkdevice; and maintaining a list of proxy network devices.
 18. Apparatusfor determining the responsiveness of a network device in a network ofinterconnected network devices, the apparatus comprising: means formaintaining a list of proxy network devices; and means for selecting atleast one proxy network device from the list of network devices when afirst network device cannot establish contact with a second networkdevice.
 19. The apparatus of claim 18 , further comprising: means forrequesting that the at least one selected proxy network device contactthe second network device and report back to the first network whetherthe requested contact is successful.
 20. The apparatus of claim 19 ,further comprising: means for analyzing the report from the at least oneproxy network device to determine a type of fault associated with thesecond network device.